{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3ce451e9-f8f1-4f27-8c6b-4a93a406d504",
   "metadata": {},
   "source": [
    "# Identity-enabled RAG using PebbloRetrievalQA\n",
    "\n",
    "> PebbloRetrievalQA is a Retrieval chain with Identity & Semantic Enforcement for question-answering\n",
    "against a vector database.\n",
    "\n",
    "This notebook covers how to retrieve documents using Identity & Semantic Enforcement (Deny Topics/Entities).\n",
    "For more details on Pebblo and its SafeRetriever feature visit [Pebblo documentation](https://daxa-ai.github.io/pebblo/retrieval_chain/)\n",
    "\n",
    "### Steps:\n",
    "\n",
    "1. **Loading Documents:**\n",
    "We will load documents with authorization and semantic metadata into an in-memory Qdrant vectorstore. This vectorstore will be used as a retriever in PebbloRetrievalQA. \n",
    "\n",
    "> **Note:** It is recommended to use [PebbloSafeLoader](https://daxa-ai.github.io/pebblo/rag) as the counterpart for loading documents with authentication and semantic metadata on the ingestion side. `PebbloSafeLoader` guarantees the secure and efficient loading of documents while maintaining the integrity of the metadata.\n",
    "\n",
    "\n",
    "\n",
    "2. **Testing Enforcement Mechanisms**:\n",
    "   We will test Identity and Semantic Enforcement separately. For each use case, we will define a specific \"ask\" function with the required contexts (*auth_context* and *semantic_context*) and then pose our questions.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ee16b6b-5dac-4b5c-bb69-3ec87398a33c",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "### Dependencies\n",
    "\n",
    "We'll use an OpenAI LLM, OpenAI embeddings and a Qdrant vector store in this walkthrough.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e68494fa-f387-4481-9a6c-58294865d7b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet langchain langchain_core langchain-community langchain-openai qdrant_client"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61498d51-0c38-40e2-adcd-19dfdf4d37ef",
   "metadata": {},
   "source": [
    "### Identity-aware Data Ingestion\n",
    "\n",
    "Here we are using Qdrant as a vector database; however, you can use any of the supported vector databases.\n",
    "\n",
    "**PebbloRetrievalQA chain supports the following vector databases:**\n",
    "- Qdrant\n",
    "- Pinecone\n",
    "- Postgres(utilizing the pgvector extension)\n",
    "\n",
    "\n",
    "**Load vector database with authorization and semantic information in metadata:**\n",
    "\n",
    "In this step, we capture the authorization and semantic information of the source document into the `authorized_identities`, `pebblo_semantic_topics`, and `pebblo_semantic_entities` fields within the metadata of the VectorDB entry for each chunk. \n",
    "\n",
    "\n",
    "*NOTE: To use the PebbloRetrievalQA chain, you must always place authorization and semantic metadata in the specified fields. These fields must contain a list of strings.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ae4fcbc1-bdc3-40d2-b2df-8c82cad1f89c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vectordb loaded.\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.vectorstores.qdrant import Qdrant\n",
    "from langchain_core.documents import Document\n",
    "from langchain_openai.embeddings import OpenAIEmbeddings\n",
    "from langchain_openai.llms import OpenAI\n",
    "\n",
    "llm = OpenAI()\n",
    "embeddings = OpenAIEmbeddings()\n",
    "collection_name = \"pebblo-identity-and-semantic-rag\"\n",
    "\n",
    "page_content = \"\"\"\n",
    "**ACME Corp Financial Report**\n",
    "\n",
    "**Overview:**\n",
    "ACME Corp, a leading player in the merger and acquisition industry, presents its financial report for the fiscal year ending December 31, 2020. \n",
    "Despite a challenging economic landscape, ACME Corp demonstrated robust performance and strategic growth.\n",
    "\n",
    "**Financial Highlights:**\n",
    "Revenue soared to $50 million, marking a 15% increase from the previous year, driven by successful deal closures and expansion into new markets. \n",
    "Net profit reached $12 million, showcasing a healthy margin of 24%.\n",
    "\n",
    "**Key Metrics:**\n",
    "Total assets surged to $80 million, reflecting a 20% growth, highlighting ACME Corp's strong financial position and asset base. \n",
    "Additionally, the company maintained a conservative debt-to-equity ratio of 0.5, ensuring sustainable financial stability.\n",
    "\n",
    "**Future Outlook:**\n",
    "ACME Corp remains optimistic about the future, with plans to capitalize on emerging opportunities in the global M&A landscape. \n",
    "The company is committed to delivering value to shareholders while maintaining ethical business practices.\n",
    "\n",
    "**Bank Account Details:**\n",
    "For inquiries or transactions, please refer to ACME Corp's US bank account:\n",
    "Account Number: 123456789012\n",
    "Bank Name: Fictitious Bank of America\n",
    "\"\"\"\n",
    "\n",
    "documents = [\n",
    "    Document(\n",
    "        **{\n",
    "            \"page_content\": page_content,\n",
    "            \"metadata\": {\n",
    "                \"pebblo_semantic_topics\": [\"financial-report\"],\n",
    "                \"pebblo_semantic_entities\": [\"us-bank-account-number\"],\n",
    "                \"authorized_identities\": [\"finance-team\", \"exec-leadership\"],\n",
    "                \"page\": 0,\n",
    "                \"source\": \"https://drive.google.com/file/d/xxxxxxxxxxxxx/view\",\n",
    "                \"title\": \"ACME Corp Financial Report.pdf\",\n",
    "            },\n",
    "        }\n",
    "    )\n",
    "]\n",
    "\n",
    "vectordb = Qdrant.from_documents(\n",
    "    documents,\n",
    "    embeddings,\n",
    "    location=\":memory:\",\n",
    "    collection_name=collection_name,\n",
    ")\n",
    "\n",
    "print(\"Vectordb loaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f630bb8b-67ba-41f9-8715-76d006207e75",
   "metadata": {},
   "source": [
    "## Retrieval with Identity Enforcement\n",
    "\n",
    "PebbloRetrievalQA chain uses a SafeRetrieval to enforce that the snippets used for in-context are retrieved only from the documents authorized for the user. \n",
    "To achieve this, the Gen-AI application needs to provide an authorization context for this retrieval chain. \n",
    "This *auth_context* should be filled with the identity and authorization groups of the user accessing the Gen-AI app.\n",
    "\n",
    "\n",
    "Here is the sample code for the `PebbloRetrievalQA` with `user_auth`(List of user authorizations, which may include their User ID and \n",
    "    the groups they are part of) from the user accessing the RAG application, passed in `auth_context`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e978bee6-3a8c-459f-ab82-d380d7499b36",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.chains import PebbloRetrievalQA\n",
    "from langchain_community.chains.pebblo_retrieval.models import AuthContext, ChainInput\n",
    "\n",
    "# Initialize PebbloRetrievalQA chain\n",
    "qa_chain = PebbloRetrievalQA.from_chain_type(\n",
    "    llm=llm,\n",
    "    retriever=vectordb.as_retriever(),\n",
    "    app_name=\"pebblo-identity-rag\",\n",
    "    description=\"Identity Enforcement app using PebbloRetrievalQA\",\n",
    "    owner=\"ACME Corp\",\n",
    ")\n",
    "\n",
    "\n",
    "def ask(question: str, auth_context: dict):\n",
    "    \"\"\"\n",
    "    Ask a question to the PebbloRetrievalQA chain\n",
    "    \"\"\"\n",
    "    auth_context_obj = AuthContext(**auth_context) if auth_context else None\n",
    "    chain_input_obj = ChainInput(query=question, auth_context=auth_context_obj)\n",
    "    return qa_chain.invoke(chain_input_obj.dict())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a267e96-70cb-468f-b830-83b65e9b7f6f",
   "metadata": {},
   "source": [
    "### 1. Questions by Authorized User\n",
    "\n",
    "We ingested data for authorized identities `[\"finance-team\", \"exec-leadership\"]`, so a user with the authorized identity/group `finance-team` should receive the correct answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "2688fc18-1eac-45a5-be55-aabbe6b25af5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question: Share the financial performance of ACME Corp for the year 2020\n",
      "\n",
      "Answer: \n",
      "Revenue: $50 million (15% increase from previous year)\n",
      "Net profit: $12 million (24% margin)\n",
      "Total assets: $80 million (20% growth)\n",
      "Debt-to-equity ratio: 0.5\n"
     ]
    }
   ],
   "source": [
    "auth = {\n",
    "    \"user_id\": \"finance-user@acme.org\",\n",
    "    \"user_auth\": [\n",
    "        \"finance-team\",\n",
    "    ],\n",
    "}\n",
    "\n",
    "question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
    "resp = ask(question, auth)\n",
    "print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4db6566-6562-4a49-b19c-6d99299b374e",
   "metadata": {},
   "source": [
    "### 2. Questions by Unauthorized User\n",
    "\n",
    "Since the user's authorized identity/group `eng-support` is not included in the authorized identities `[\"finance-team\", \"exec-leadership\"]`, we should not receive an answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "2d736ce3-6e05-48d3-a5e1-fb4e7cccc1ee",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question: Share the financial performance of ACME Corp for the year 2020\n",
      "\n",
      "Answer:  I don't know.\n"
     ]
    }
   ],
   "source": [
    "auth = {\n",
    "    \"user_id\": \"eng-user@acme.org\",\n",
    "    \"user_auth\": [\n",
    "        \"eng-support\",\n",
    "    ],\n",
    "}\n",
    "\n",
    "question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
    "resp = ask(question, auth)\n",
    "print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33a8afe1-3071-4118-9714-a17cba809ee4",
   "metadata": {},
   "source": [
    "### 3. Using PromptTemplate to provide additional instructions\n",
    "You can use PromptTemplate to provide additional instructions to the LLM for generating a custom response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "59c055ba-fdd1-48c6-9bc9-2793eb47438d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.prompts import PromptTemplate\n",
    "\n",
    "prompt_template = PromptTemplate.from_template(\n",
    "    \"\"\"\n",
    "Answer the question using the provided context. \n",
    "If no context is provided, just say \"I'm sorry, but that information is unavailable, or Access to it is restricted.\".\n",
    "\n",
    "Question: {question}\n",
    "\"\"\"\n",
    ")\n",
    "\n",
    "question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
    "prompt = prompt_template.format(question=question)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4d27c00-73d9-4ce8-bc70-29535deaf0e2",
   "metadata": {},
   "source": [
    "#### 3.1 Questions by Authorized User"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "e68a13a4-b735-421d-9655-2a9a087ba9e5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question: Share the financial performance of ACME Corp for the year 2020\n",
      "\n",
      "Answer: \n",
      "Revenue soared to $50 million, marking a 15% increase from the previous year, and net profit reached $12 million, showcasing a healthy margin of 24%. Total assets also grew by 20% to $80 million, and the company maintained a conservative debt-to-equity ratio of 0.5.\n"
     ]
    }
   ],
   "source": [
    "auth = {\n",
    "    \"user_id\": \"finance-user@acme.org\",\n",
    "    \"user_auth\": [\n",
    "        \"finance-team\",\n",
    "    ],\n",
    "}\n",
    "resp = ask(prompt, auth)\n",
    "print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b97a9ca-bdc6-400a-923d-65a8536658be",
   "metadata": {},
   "source": [
    "#### 3.2 Questions by Unauthorized Users"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "438e48c6-96a2-4d5e-81db-47f8c8f37739",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question: Share the financial performance of ACME Corp for the year 2020\n",
      "\n",
      "Answer: \n",
      "I'm sorry, but that information is unavailable, or Access to it is restricted.\n"
     ]
    }
   ],
   "source": [
    "auth = {\n",
    "    \"user_id\": \"eng-user@acme.org\",\n",
    "    \"user_auth\": [\n",
    "        \"eng-support\",\n",
    "    ],\n",
    "}\n",
    "resp = ask(prompt, auth)\n",
    "print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4306cab3-d070-405f-a23b-5c6011a61c50",
   "metadata": {},
   "source": [
    "## Retrieval with Semantic Enforcement"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c3757cf-832f-483e-aafe-cb09b5130ec0",
   "metadata": {},
   "source": [
    "The PebbloRetrievalQA chain uses SafeRetrieval to ensure that the snippets used in context are retrieved only from documents that comply with the\n",
    "provided semantic context.\n",
    "To achieve this, the Gen-AI application must provide a semantic context for this retrieval chain.\n",
    "This `semantic_context` should include the topics and entities that should be denied for the user accessing the Gen-AI app.\n",
    "\n",
    "Below is a sample code for PebbloRetrievalQA with `topics_to_deny` and `entities_to_deny`. These are passed in `semantic_context` to the chain input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "daf37bf7-9a16-4102-8893-5b698cae1b07",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Optional\n",
    "\n",
    "from langchain_community.chains import PebbloRetrievalQA\n",
    "from langchain_community.chains.pebblo_retrieval.models import (\n",
    "    ChainInput,\n",
    "    SemanticContext,\n",
    ")\n",
    "\n",
    "# Initialize PebbloRetrievalQA chain\n",
    "qa_chain = PebbloRetrievalQA.from_chain_type(\n",
    "    llm=llm,\n",
    "    retriever=vectordb.as_retriever(),\n",
    "    app_name=\"pebblo-semantic-rag\",\n",
    "    description=\"Semantic Enforcement app using PebbloRetrievalQA\",\n",
    "    owner=\"ACME Corp\",\n",
    ")\n",
    "\n",
    "\n",
    "def ask(\n",
    "    question: str,\n",
    "    topics_to_deny: Optional[List[str]] = None,\n",
    "    entities_to_deny: Optional[List[str]] = None,\n",
    "):\n",
    "    \"\"\"\n",
    "    Ask a question to the PebbloRetrievalQA chain\n",
    "    \"\"\"\n",
    "    semantic_context = dict()\n",
    "    if topics_to_deny:\n",
    "        semantic_context[\"pebblo_semantic_topics\"] = {\"deny\": topics_to_deny}\n",
    "    if entities_to_deny:\n",
    "        semantic_context[\"pebblo_semantic_entities\"] = {\"deny\": entities_to_deny}\n",
    "\n",
    "    semantic_context_obj = (\n",
    "        SemanticContext(**semantic_context) if semantic_context else None\n",
    "    )\n",
    "    chain_input_obj = ChainInput(query=question, semantic_context=semantic_context_obj)\n",
    "    return qa_chain.invoke(chain_input_obj.dict())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9718819b-f5cd-4212-9947-d18cd507c8b7",
   "metadata": {},
   "source": [
    "### 1. Without semantic enforcement\n",
    "\n",
    "Since no semantic enforcement is applied, the system should return the answer without excluding any context due to the semantic labels associated with the context.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "69158be1-f223-4d14-b61f-f4afdf5af526",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topics to deny: []\n",
      "Entities to deny: []\n",
      "Question: Share the financial performance of ACME Corp for the year 2020\n",
      "Answer: \n",
      "Revenue for ACME Corp increased by 15% to $50 million in 2020, with a net profit of $12 million and a strong asset base of $80 million. The company also maintained a conservative debt-to-equity ratio of 0.5.\n"
     ]
    }
   ],
   "source": [
    "topic_to_deny = []\n",
    "entities_to_deny = []\n",
    "question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
    "resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)\n",
    "print(\n",
    "    f\"Topics to deny: {topic_to_deny}\\nEntities to deny: {entities_to_deny}\\n\"\n",
    "    f\"Question: {question}\\nAnswer: {resp['result']}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8789c58-0d64-404e-bc09-92f6952022ac",
   "metadata": {},
   "source": [
    "### 2. Deny financial-report topic\n",
    "\n",
    "Data has been ingested with the topics: `[\"financial-report\"]`.\n",
    "Therefore, an app that denies the `financial-report` topic should not receive an answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "9b17b2fc-eefb-4229-a41e-2f943d2eb48e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topics to deny: ['financial-report']\n",
      "Entities to deny: []\n",
      "Question: Share the financial performance of ACME Corp for the year 2020\n",
      "Answer: \n",
      "\n",
      "Unfortunately, I do not have access to the financial performance of ACME Corp for the year 2020.\n"
     ]
    }
   ],
   "source": [
    "topic_to_deny = [\"financial-report\"]\n",
    "entities_to_deny = []\n",
    "question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
    "resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)\n",
    "print(\n",
    "    f\"Topics to deny: {topic_to_deny}\\nEntities to deny: {entities_to_deny}\\n\"\n",
    "    f\"Question: {question}\\nAnswer: {resp['result']}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "894f21b0-2913-4ef6-b5ed-cbca8f74214d",
   "metadata": {},
   "source": [
    "### 3. Deny us-bank-account-number entity\n",
    "Since the entity `us-bank-account-number` is denied, the system should not return the answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "2b8abce3-7af3-437f-8999-2866a4b9beda",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topics to deny: []\n",
      "Entities to deny: ['us-bank-account-number']\n",
      "Question: Share the financial performance of ACME Corp for the year 2020\n",
      "Answer:  I don't have information about ACME Corp's financial performance for 2020.\n"
     ]
    }
   ],
   "source": [
    "topic_to_deny = []\n",
    "entities_to_deny = [\"us-bank-account-number\"]\n",
    "question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
    "resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)\n",
    "print(\n",
    "    f\"Topics to deny: {topic_to_deny}\\nEntities to deny: {entities_to_deny}\\n\"\n",
    "    f\"Question: {question}\\nAnswer: {resp['result']}\"\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
